531 research outputs found
Temporally coherent 4D reconstruction of complex dynamic scenes
This paper presents an approach for reconstruction of 4D temporally coherent
models of complex dynamic scenes. No prior knowledge is required of scene
structure or camera calibration allowing reconstruction from multiple moving
cameras. Sparse-to-dense temporal correspondence is integrated with joint
multi-view segmentation and reconstruction to obtain a complete 4D
representation of static and dynamic objects. Temporal coherence is exploited
to overcome visual ambiguities resulting in improved reconstruction of complex
scenes. Robust joint segmentation and reconstruction of dynamic objects is
achieved by introducing a geodesic star convexity constraint. Comparative
evaluation is performed on a variety of unstructured indoor and outdoor dynamic
scenes with hand-held cameras and multiple people. This demonstrates
reconstruction of complete temporally coherent 4D scene models with improved
nonrigid object segmentation and shape reconstruction.Comment: To appear in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) 2016 . Video available at:
https://www.youtube.com/watch?v=bm_P13_-Ds
General Dynamic Scene Reconstruction from Multiple View Video
This paper introduces a general approach to dynamic scene reconstruction from
multiple moving cameras without prior knowledge or limiting constraints on the
scene structure, appearance, or illumination. Existing techniques for dynamic
scene reconstruction from multiple wide-baseline camera views primarily focus
on accurate reconstruction in controlled environments, where the cameras are
fixed and calibrated and background is known. These approaches are not robust
for general dynamic scenes captured with sparse moving cameras. Previous
approaches for outdoor dynamic scene reconstruction assume prior knowledge of
the static background appearance and structure. The primary contributions of
this paper are twofold: an automatic method for initial coarse dynamic scene
segmentation and reconstruction without prior knowledge of background
appearance or structure; and a general robust approach for joint segmentation
refinement and dense reconstruction of dynamic scenes from multiple
wide-baseline static or moving cameras. Evaluation is performed on a variety of
indoor and outdoor scenes with cluttered backgrounds and multiple dynamic
non-rigid objects such as people. Comparison with state-of-the-art approaches
demonstrates improved accuracy in both multiple view segmentation and dense
reconstruction. The proposed approach also eliminates the requirement for prior
knowledge of scene structure and appearance
U4D: Unsupervised 4D Dynamic Scene Understanding
We introduce the first approach to solve the challenging problem of
unsupervised 4D visual scene understanding for complex dynamic scenes with
multiple interacting people from multi-view video. Our approach simultaneously
estimates a detailed model that includes a per-pixel semantically and
temporally coherent reconstruction, together with instance-level segmentation
exploiting photo-consistency, semantic and motion information. We further
leverage recent advances in 3D pose estimation to constrain the joint semantic
instance segmentation and 4D temporally coherent reconstruction. This enables
per person semantic instance segmentation of multiple interacting people in
complex dynamic scenes. Extensive evaluation of the joint visual scene
understanding framework against state-of-the-art methods on challenging indoor
and outdoor sequences demonstrates a significant (approx 40%) improvement in
semantic segmentation, reconstruction and scene flow accuracy.Comment: To appear in IEEE International Conference in Computer Vision ICCV
201
A-Subclass ATP-Binding Cassette Proteins in Brain Lipid Homeostasis and Neurodegeneration
The A-subclass of ATP-binding cassette (ABC) transporters comprises 12 structurally related members of the evolutionarily highly conserved superfamily of ABC transporters. ABCA transporters represent a subgroup of “full-size” multispan transporters of which several members have been shown to mediate the transport of a variety of physiologic lipid compounds across membrane barriers. The importance of ABCA transporters in human disease is documented by the observations that so far four members of this protein family (ABCA1, ABCA3, ABCA4, ABCA12) have been causatively linked to monogenetic disorders including familial high-density lipoprotein deficiency, neonatal surfactant deficiency, degenerative retinopathies, and congenital keratinization disorders. Recent research also point to a significant contribution of several A-subfamily ABC transporters to neurodegenerative diseases, in particular Alzheimer’s disease (AD). This review will give a summary of our current knowledge of the A-subclass of ABC transporters with a special focus on brain lipid homeostasis and their involvement in AD
UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
Existing person image generative models can do either image generation or
pose transfer but not both. We propose a unified diffusion model, UPGPT to
provide a universal solution to perform all the person image tasks -
generative, pose transfer, and editing. With fine-grained multimodality and
disentanglement capabilities, our approach offers fine-grained control over the
generation and the editing process of images using a combination of pose, text,
and image, all without needing a semantic segmentation mask which can be
challenging to obtain or edit. We also pioneer the parameterized body SMPL
model in pose-guided person image generation to demonstrate new capability -
simultaneous pose and camera view interpolation while maintaining a person's
appearance. Results on the benchmark DeepFashion dataset show that UPGPT is the
new state-of-the-art while simultaneously pioneering new capabilities of edit
and pose transfer in human image generation
Multi-person Implicit Reconstruction from a Single Image
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches
4D Temporally Coherent Light-field Video
Light-field video has recently been used in virtual and augmented reality
applications to increase realism and immersion. However, existing light-field
methods are generally limited to static scenes due to the requirement to
acquire a dense scene representation. The large amount of data and the absence
of methods to infer temporal coherence pose major challenges in storage,
compression and editing compared to conventional video. In this paper, we
propose the first method to extract a spatio-temporally coherent light-field
video representation. A novel method to obtain Epipolar Plane Images (EPIs)
from a spare light-field camera array is proposed. EPIs are used to constrain
scene flow estimation to obtain 4D temporally coherent representations of
dynamic light-fields. Temporal coherence is achieved on a variety of
light-field datasets. Evaluation of the proposed light-field scene flow against
existing multi-view dense correspondence approaches demonstrates a significant
improvement in accuracy of temporal coherence.Comment: Published in 3D Vision (3DV) 201
Semantically Coherent Co-Segmentation and Reconstruction of Dynamic Scenes
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes
SEM-POS: Grammatically and Semantically Correct Video Captioning
Generating grammatically and semantically correct captions in video
captioning is a challenging task. The captions generated from the existing
methods are either word-by-word that do not align with grammatical structure or
miss key information from the input videos. To address these issues, we
introduce a novel global-local fusion network, with a Global-Local Fusion Block
(GLFB) that encodes and fuses features from different parts of speech (POS)
components with visual-spatial features. We use novel combinations of different
POS components - 'determinant + subject', 'auxiliary verb', 'verb', and
'determinant + object' for supervision of the POS blocks - Det + Subject, Aux
Verb, Verb, and Det + Object respectively. The novel global-local fusion
network together with POS blocks helps align the visual features with language
description to generate grammatically and semantically correct captions.
Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT
datasets demonstrate that the proposed approach generates more grammatically
and semantically correct captions compared to the existing methods, achieving
the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate
the impact of the contributions on the proposed method
- …